Construction and validation of a prognostic model for tongue cancer based on three genes signature

Tongue squamous cell carcinoma (TSCC) has a poor prognosis and destructive characteristics. Reliable biomarkers are urgently required to predict disease outcomes and to guide TSCC treatment. This study aimed to develop a multigene signature and prognostic nomogram that can accurately predict the prognosis of patients with TSCC. We screened differentially expressed genes associated with TSCC using The Cancer Genome Atlas dataset. Based on this, we developed a new multi-mRNA gene signature using univariate Cox regression, Least Absolute Shrinkage and Selection Operator regression, and multivariate Cox regression. We used the concordance index to evaluate the accuracy of this new multigene model. Moreover, we performed receiver operating characteristic and Kaplan–Meier survival analyses to assess the predictive ability of the new multigene model. In addition, we created a prognostic nomogram incorporating clinical and pathological characteristics, with the aim of enhancing the adaptability of this model in practical clinical settings. We successfully developed a new prognostic model based on the expression levels of these 3 mRNAs that can be used to predict the prognosis of patients with TSCC. This prediction model includes 3 genes: KRT33B, CDKN2A, and CA9. In the validation set, the concordance index of this model was 0.851, and the area under the curve was 0.778 and 0.821 in the training and validation sets, respectively. Kaplan–Meier survival analysis showed that regardless of whether it was in the training or validation set, the prognosis of high-risk patients was significantly worse than that of low-risk patients (P < .001). Multivariate Cox regression analysis revealed that this model was an independent prognostic factor for patients with TSCC (P < .001). Our study suggests that this 3-gene signature model has a high level of accuracy and predictive ability, is closely related to the overall survival rate of patients with TSCC, and can independently predict the prognosis of TSCC patients with high accuracy and predictive ability.


Introduction
Tongue squamous cell carcinoma (TSCC) is the most common type of cancer in the oral cavity, accounting for over 90% of all malignant tumors in the oral cavity. [1,2]The primary treatment options for tongue cancer include surgery, conventional radiotherapy, and chemotherapy.Despite advances in the treatment of TSCC, the overall 5-year survival rate remains low. [3]herefore, there is an urgent need to develop new prognostic models to identify high-risk patients with TSCC and to provide personalized treatment plans.In addition, an accurate and quantitative prognostic model is necessary to assist patients with consultation.
Increasing evidence suggests that the abnormal expression of certain messenger RNAs (mRNAs) is closely related to the prognosis of patients with TSCC, and these mRNAs can be used as molecular biomarkers for the prognostic assessment and identification of potential high-risk patients.[6][7][8] However, the prognostic features and nomographs of multiple mRNA in TSCC patients have not been sufficiently explored.
With the development of second-generation sequencing technology, attention has gradually turned to the analysis and alignment of high-throughput sequencing data, leading to the establishment of The Cancer Genome Atlas (TCGA) database, which provides an avenue for tumor-related research through the mining and analysis of genomic data on cancer genes.Initiated and completed by the National Cancer Institute and the National Human Genome Research Institute in the United States, TCGA is a genomic variation map of human tumors obtained through large-scale sequencing of the human tumor genome, containing information on genomics, transcriptomics, epigenetics, proteomics, as well as clinical information.It has been shown that there are significant differences in cancer-related gene expression between TSCC samples and normal tongue tissue samples. [9]Therefore, we employed various methods, including univariate Cox regression, Least Absolute Shrinkage and Selection Operator (LASSO) regression, and multivariate Cox regression, to establish a prognostic model based on multiple mRNAs that could reliably and accurately predict the prognosis of TSCC patients.The accuracy of the model was evaluated by testing its performance and predicting the patient prognosis.

Data download and extraction
We downloaded normal and control samples for oral tongue cancer and the base of tongue cancer from the official website of TCGA at https://portal.gdc.cancer.gov/.To select the appropriate cases, we choose either "other and unspecified parts of the tongue" or "base of tongue" from the "Primary Site" menu.We selected "TCGA" as the program and "TCGA-HNSC" as the project.To minimize the impact of race on the experimental conclusions as much as possible, we only extracted information from the Caucasian population.A total of 152 samples, consisting of 137 tumor samples and 15 normal control samples, were downloaded.Each sample was analyzed for the expression of 59,427 transcripts, including both encoded and non-encoded transcripts.

Model construction
We used the Perl program to extract the necessary raw data and utilized the R language "limma" package for differential analysis.The extracted data type is transcripts per million (tpm), which represents the number of transcripts per million mRNA molecules in a sample.The criteria for differential gene screening were an absolute |logFC| should be >2 and FDR should be <0.01.Remove genes with an average expression value below 0.5 tpm.These differentially expressed genes (DEGs) could be considered as candidate genes for further regression analysis and prognosis model establishment.We used the "pacman" and "limma" packages in R to merge the DEGs and clinical data.The "caret" package in R was used to randomly divide the TCGA data into training and validation sets at a 1:1 ratio.In the training set, the "survival" package in R was used for univariate Cox regression analysis to obtain DEGs related to overall survival (OS), with a filtering condition of P < .1.The "glmnet" and "survival" packages in R were used to screen the prognosis genes by LASSO regression analysis and visualize the data to avoid overfitting.Subsequently, the "survival" and "survminer" packages in R were used for backward multivariate Cox hazard regression analysis and data visualization to select the optimal model that could predict the OS of TSCC.Finally, we calculated the mRNA-based prognostic risk score by combining the expression values of prognostic genes and their regression coefficients in the multivariate Cox regression model (backward).The formula for the risk score was as follows: In this study, "i" represents the gene in the model, "n" represents the number of genes in the model, "Exp i " represents the expression value of gene "i," and "B i " represents the regression coefficient of gene "i" in multivariate Cox regression analysis.The training and validation sets of patients with TSCC were divided into high-and low-risk groups, based on the median risk score.

Verification of model accuracy
The concordance index (c-index) for the model in the validation set was calculated using the "survival" package in R, incorporating both the risk score and clinicopathological features.

Assessment of model prognostic ability
Time-dependent receiver operating characteristic (ROC) analysis of the training and validation sets was performed using the Kaplan-Meier "survivalROC" package in R. ROC curves were plotted, and the area under the curve (AUC) was calculated.Kaplan-Meier survival analysis and curves were plotted in R using the plan "Meier survival" package for both patient sets.Heat maps for risk stratification were generated using the "pheatmap" package in R. The distribution of risk scores and corresponding survival curves were plotted in R using the base package.analysis (backward) and data visualization were performed using R packages "survival" and "survminer" to select features for constructing a forest plot.Subsequently, a forest plot predicting OS of patients with TSCC was constructed based on the selected features.

Statistical analysis
Unless otherwise specified, all significance tests were 2-tailed, and a P value <0.05 was considered statistically significant.

Clinical characteristics of TSCC patients in TCGA
We collected data on 137 patients from TCGA database, including 94 males and 43 females.The median age was 60 years (range 19-87), with 39 patients aged 65 years or older.According to the AJCC staging system, there were 12, 24, 32, and 69 cases of stages I and II disease, 32 cases of stage III disease, and 69 cases of stage IV disease, respectively.The primary tumor sites were the oral tongue in 108 patients, the base of the tongue in 25 patients, and the floor of the mouth in 4 patients.Lymph node involvement was observed in 69 patients.No distant metastases were detected in any patient.

Identifying genes with differential expression compared to normal tissue
DEGs were identified and compared to normal tissue using the "limma" package under the conditions specified in Section 2. After filtering out genes with average expression below 0.5 tpm, a total of 1009 genes remained and were selected for further regression analysis.Integration of RNA expression data and clinicopathological features was conducted using the "pacman" and "limma" packages in the R language.The cases were randomly divided into training and testing sets at a ratio of 1:1, according to their clinicopathological characteristics (Table 1).

Univariate Cox regression analysis of DEGs
We used the Kaplan-Meier "survival" package in R to conduct univariate Cox regression analysis on the training set to evaluate the role of candidate DEGs in TSCC prognosis.Among the 1009 genes in the training set, 46 were associated with the prognosis (Table 2).

A prognostic model containing 3 genes was established using the LASSO regression and stepwise multivariate Cox hazard ratio regression methods
Incorporating the 46 genes selected by univariate Cox regression analysis, LASSO regression analysis was applied by regularization with a penalty term to avoid overfitting and select the most important features.The regularization parameter lambda was set to log lambda.min.Four genes, KRT33B, cyclin dependent kinase inhibitor 2A (CDKN2A), carbonic anhydrase 9 (CA9), and ITGA5, were ultimately chosen as statistically significant (Fig. 1) and incorporated into the multivariate Cox regression model.Subsequently, a prognostic model including 3 genes, KRT33B, CDKN2A, and CA9, was established (Fig. 2) based on their expression levels and their association with prognosis in TSCC, which were depicted in the Figure S1, Supplemental Digital Content, http://links.lww.com/MD/K753 and Figure S2, Supplemental Digital Content, http://links.lww.com/MD/K754.Risk Score = −4.605689081× KRT33B expression − 0.054674985 × CDKN2A expression + 0.022976245 × CA9 expression

Validation of model accuracy and predictive ability
The risk score of each patient in the training and validation sets was calculated according to the risk score formula, and patients in each set were divided into high-and low-risk groups based on the median risk score.The c-index of the validation set was 0.851, indicating high accuracy of the model.KM survival analysis showed that patients with high-risk scores had a significantly worse prognosis than those with low-risk scores in both the training and validation sets (Fig. 3A and B, P < .001).The 5-year survival rate for high-risk individuals in the training set was 23.6%, whereas it was 72.2% for the low-risk group (P = .001).In the validation set, the 5-year survival rate for high-risk individuals was 40.5%, compared to 66.6% for the low-risk group (P = .001).The ROC analysis showed that the AUC in the training and validation sets were 0.821 and 0.778, respectively (Fig. 4A and B).This indicates that the model has a high predictive ability and accuracy.Through the risk heatmap of the training and validation sets (Fig. 5), we found significant differences in the expression levels of the 3 genes between the high-risk and low-risk groups.The risk score distribution chart (Fig. 6) shows that the risk score was linearly distributed.The risk score survival chart (Fig. 7) showed that the mortality risk of the high scorers significantly increased.These results suggest that our predictive model has a high predictive ability and stability.

Clinical application of the model
Patients were divided into 4 risk groups according to quartiles of the risk score: extremely low-risk, low-risk, high-risk, and extremely high-risk.Based on the risk group and clinicopathological features, a multivariate Cox regression analysis (backward) was performed on TSCC patients, and 3 risk factors for TSCC were ultimately identified: grade, T, and risk score (Table 3).All 3 risk factors had 4 grades each (Grade 1-4, T 1-4, 4 risk score levels).Compared to Grade and T, our model has an HR of 2.325, which is significantly higher than the other 2 risk factors, indicating that our model has a greater advantage in predicting prognosis compared to Grade and T. A survival curve was constructed to predict the OS of patients with TSCC using these 3 risk factors (Fig. 8), thus providing a valuable tool for clinical decision-making.

Discussion and conclusion
We established a prognostic model based on only 3 genes to predict the prognosis of patients with TSCC.The predictive power of this model was stronger than that of the other clinicopathological factors, with an HR of 2.325 (95% confidence interval: 1.715-3.152).To evaluate the accuracy and reliability of this model, we conducted c-index tests, ROC analyses, and KM analyses, all of which showed that the model had good predictive accuracy.Therefore, we believe that this 3-gene model can effectively predict the prognosis of TSCC patients.Furthermore, to make this model more intuitive and practical, we created a column chart integrating risk assessment, grading, and T staging to predict the prognosis.
Our model includes 3 genes: KRT33B, CDKN2A, and CA9.KRT33B, also known as keratin 33 B, is a member of the keratin family.In normal tissues, keratin encoded by this gene promotes the growth and maintenance of hair and nails.[12][13] Studies have shown that KRT33B is significantly suppressed in low-grade gliomas and is associated with poor prognosis, [14] suggesting that this protein may be a marker for differentiation status, with high expression indicating better differentiation and prognosis.This is consistent with our results (Figure S1A, Supplemental Digital Content, http://links.lww.com/MD/K753 and Figure S2A, Supplemental Digital Content, http://links.lww.com/MD/K754), where patients with low KRT33B expression had significantly worse outcomes than those with high expression.CDKN2A, also known as cyclin dependent kinase inhibitor 2A/ p16, plays an important role in the occurrence and development of head and neck squamous cell carcinoma (HNSCC) and is a well-known tumor suppressor gene, particularly in HPV-related HNSCC. [15,16]Our study also found that low CDKN2A expression in TSCC was significantly associated with a poor prognosis (Figure S1B, Supplemental Digital Content, http://links.lww.com/MD/K753 and Figure S2B, Supplemental Digital Content, http://links.lww.com/MD/K754),indicating the importance of CDKN2A in TSCC.CAs are a class of large zinc metalloenzymes that catalyze the reversible hydration of carbon dioxide and participate in a variety of biological processes.CA9, also known as carbonic anhydrase 9, is one of the only tumor-associated isozymes in the CA family. [17]A study on gliomas found that targeting this gene can effectively inhibit tumor growth. [18]ang et al found that high CA9 expression in TSCC is closely related to poor pathological T staging and warrants further exploration as a potential prognostic biomarker and therapeutic target for OTSCC. [19]Guan et al [9] also found that CA9 plays an important role in TSCC prognosis and tumor grading and that its expression is closely related to the regulation of cell differentiation, multiple oncogenes, and pathways related to cancer.
Our research found that CA9 is also related to the prognosis of TSCC patients, with poorer prognosis in high expressers (Figure S1C, Supplemental Digital Content, http://links.lww.com/MD/K753 and Figure S2C, Supplemental Digital Content, http:// links.lww.com/MD/K754).All 3 genes mentioned above are potentially closely related to the occurrence and development of TSCC, and further in-depth research may provide a theoretical basis and significant breakthroughs for treatment and prognosis monitoring.
Gene expression prediction models have been used to predict the probability of related diseases or other phenotypes by analyzing the gene expression.Such models have demonstrated a good predictive performance and can provide personalized medical services and treatment plans for different patients, thereby contributing to improved treatment efficacy and lower costs.However, gene expression prediction models also have limitations, mainly evident in their tendency to overfit, resulting in poor predictive performance on testing datasets.In addition, gene expression-based prediction models tend to produce many gene variables that lack intuitive and easy-to-understand indicators, making it challenging to transform these variables into clinical indicators.[22] However, studies of gene expression prediction models for TSCC are limited.Montero et al [23] constructed a prognostic nomogram for patients with oral cancer, which included some important clinicopathological variables, such as age, sex, race, alcohol and smoking history, subsite of the oral lesion, invasion of other structures, tumor size, and clinical lymph node status.However, biological markers were not included in the model.The c-index of the model was only 0.67.Subsequently, Bobdey et al [24] developed a prognostic model for patients with oral cancer with a c-index of 0.7263; however, biological markers were not included.Jiang et al [25] integrated GEO and TCGA datasets to establish an OTSCC prognostic model that included 16 genes; however, the c-index of this model for predicting OS was only 0.652, suggesting a relatively low predictive ability.Recently, Liu et al established a prognostic model consisting of 15 genes with an internal validation c-index of 0.849 and an external validation c-index of 0.804.However, this model was only targeted for oral tongue cancer and did not include floor of the mouth and tongue root cancers, limiting its scope of application. [26]he above model either predicts a low performance or contains too many genes.Currently, no prediction model includes a gene signature/biomarker for all patients with TSCC.Therefore, the model we established can predict the prognosis of all tongue cancers, including tongue root and base cancer.The model contained only 3 genes closely related to cancer development, but the c-index (0.851) and AUC (0.821) reached high levels, which can accurately predict the progression of the disease and greatly improve the accuracy of clinical diagnosis.This method is expected to become a part of precision medicine.Second, our model contained only 3 genes with high operability, which greatly reduced the complexity of clinical testing and improved the stability and accuracy of the results, thereby improving the efficiency of clinical diagnosis.In the future, it will likely become an important tool for predicting TSCC prognosis of TSCC.To reduce genetic collinearity and improve the accuracy of the results, this study used LASSO stepwise regression analysis before conducting a multiple factor Cox regression to improve the predictive accuracy of the model.Finally, we established a prognostic nomogram that integrates this predictive model with clinicopathological features, making the results more intuitive, easy to understand, and easy to operate, which will help promote the application of this model in clinical practice.Our model, as the most advantageous prognostic factor, can predict the prognosis of patients with TSCC more accurately when combined with clinical and pathological data.However, owing to the lack of data on patients with distant metastasis in TCGA, the M stage/TNM stage were assigned a lower weight in the nomogram.Only T stage was included as a prognostic factor.It is widely known that TNM staging has a significant impact on the prognosis of patients, especially M status.Therefore, in future clinical validation, TNM staging/M staging should be considered when designing nomograms.Further prospective studies are required to confirm our findings.

Figure 1 .
Figure 1.LASSO regression.(A) Regression analysis using LASSO-penalized logistic regression with parameter λ adjusted by 10-fold cross-validation.(B) The sum of coefficients βi obtained by LASSO feature selection represents the coefficient sum of the LASSO model.LASSO = Least Absolute Shrinkage and Selection Operator.

Figure 2 .
Figure 2. Forest plot of the prognostic model constructed by stepwise multivariate Cox regression analysis, including 3 genes.

Figure 3 .
Figure 3. Kaplan-Meier survival curve based on risk score model of the training set: (A) training set and (B) testing set.

Figure 4 .
Figure 4. ROC analysis based on risk score model of the training set: (A) training set and (B) testing set.ROC = receiver operating characteristic.

Figure 7 .
Figure 7. Risk score survival based on risk score model of the training set: (A) training set and (B) testing set.

Figure 6 .
Figure 6.Risk score distribution based on risk score model of the training set: (A) training set and (B) testing set.

Figure 8 .
Figure 8. Column chart predicting the overall 3-year and 5-year survival of TSCC patients by integrating risk groups and clinicopathological features.TSCC, tongue squamous cell carcinoma.

Table 1
Clinical and pathological characteristics of patients included in the training and validation sets.

Table 2
The 46 significantly differentially expressed cancer-related genes (P < .1)identified by univariate Cox regression analysis.

Table 3
Three characteristics selected for constructing the survival curve by multivariable Cox regression analysis.